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(54) Computer system 

(57) A computer system comprises: a processing 
system (10, 12) for processing data; a memory (14) for 
storing data processed by. or to be processed by. the 
processing system; a memory access controller (16) for 
controlling access to the memory; and at least one data 
buffer (40) for buffering data to be written to or read from 
the memory A burst controller (32) is provided for issu- 
ing burst instructions to the memory access controller, 
and the memory access controller is responsive to such 
a burst instruction to transfer a plurality of data words 
between the memory and the data buffer in a single 
memory transaction. A burst instruction queue is pro- 
vided (30) so that such a burst instruction can be made 
available for execution by the memory access controller 
immediately after a preceding burst instruction has 
been executed. Each such burst instruction includes or 
is associated with a parameter defining a spacing 
between locations in the memory to be accessed in 
response to that burst instruction, and the memory 
access controller is responsive to such a burst instruc- 
tion to transfer a plurality of data elements between the 
memory, at locations spaced in accordance witii the 
spacing parameter, and the data buffer in a single mem- 
ory transaction. The system is particularly applicable for 
processing media data which has high spatial locality 
and regularity, but low temporal locality, and enables 
high performance to be extracted from cheap memory 
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Description 

This invention relates to computer systems, and in particular, but not exclusively, to such systems for processing 

media data. - 

An optimal conputer architecture is one which meets its performance requirements wfhilst achieving minimum cost. 
In a media-intensive appliance system, at present the main haidware cost contributor is memory The memory must 
have enough capacity to hold the media data and provide enough access bandwidth in order that the computaton 
throughput requirements can be met. Such an appliance system needs to maximise the data throughput, as OPPOSM 
to a normal processor which usually has to maximise the Instruction throughput. The present invention is concerned in 
particular, but not exclusively, with extracting high performance from low cost memory, given the restraints of processing 

media-intensive algorithms. . . . . , , 

The present invention relates in particular to a computer system of the type compnsing: a processing system for 
processing data: a memory (provided for example by dynamic RAM ("DRAM-)) for storing data processed by, or to be 
processed by the processing system: a memory access controller for controlling access to the memory: and a data 
buffer (provided for example by static RAM ("SRAM")) for buffering data to be written to or read from the memory 

At present the cheapest form of symmetric read-write memory is DRAM. (By symmetric, it is meant that read and 
write accesses take identical times, unlike reads and writes with Rash memory.) DRAM is at present used ectensively 
in personal computers as the main memory, with faster (and more expensive) technotogies such as state SRAM being 
used for data buffers or caches closer to the processor. In a low cost system, there is a need to use the lowest cost 
memory that permits the performance (and power) goals to be met. In the making of the present invention, an ana^fJ! 
has been performed of the cheapest DRAM technologies in order to understand the maximum data bandwidths which 
could be obtained, and it is dear that existing systems are not utilising the available bandwidth. The present inventon 
is concerned with increasing the use of the available bandwidth and therefore increasing the overall efficiency of the 
memory in such a computer system and in similar systems. 

A typical processor can access SRAM cache in 10ns. However, an access to main DRAM memory may take 200ns 
in an embedded system, where memory cost needs to be minimised, which is a twentyfold increase. Thus, in ord^ to 
ensure high throughput, it is necessary to place as much data in the local cache memory block before it is rieeded. 
Then, the processor only sees the latency of access to the fast, local cache memory, rather than the longer delay to 

main memory _^ .. , . „ 

"Latency" is the time taken to fetch a datum from memory. It is of paramount concern in systems which are com- 
pute-bound' i e where the performance of the system is dictated by the processor. The large factor between local and 
main memoiV speed may cause the processing to be determined by the performance of the memory system. This case 
is -bandwidth-bound" and is ultimately limited by the bandwidth of the memory system. If the processor goes fast 
enough compared to the memory, it may generate requests at a faster rate than the memory can satisfy. Many systenre 
35 today are crossing from being compute-bound to being bandwidth-bound. 

Using faster memory is one technique for alleviating the performance problem. However, this adds cost. An alter- 
native approach is to recognise that existing menrory chips are used inefficiently and to evolve new methods to access 
this memory more efficiently. < 
A feature of conventional DRAM construction is that it enables access in "bursts . A DRAM compnses an array of 
40 memory locations in a square matrix. To access an element in the array, a row must first be selected (or 'opened-), fol- 
lowed by selection of the appropriate column. However, once a row has been selected, successive accesses to col- 
umns in that row may be performed by just providing the column address. The concept of opening a row and performing 
a sequence of accesses local to that row is called a "burst". 

The term "burst efficiency" used in this specification is a measure of the ratio of (a) the minimum access time to the 
DRAM to (b) the average access time to the DRAM. A DRAM access involves one long access and (n-1) shorter 
accesses in order to burst n data items. Thus, the longer the burst, the more reduced the average access time (and so, 
the higher the bandwidth). Typically, a cache-based system (for reasons of cache architecture and bus width) will use 
bursts of four accesses. This retetes to about 25 to 40% burst efficiency For a burst length of 16 to 32 accesses, the 
efficiencyisabout80%, i.e. about double. • ^ « 

The term 'saturation efficiency" used in this specification is a measure of how frequently there is traffic on the 
DRAM bus In a processor-bound system, the bus will idle until there is a cache miss and then there will be a 4-access 
burst to fetch a new cache line. In this case, latency is very imporfant. Thus, there is low saturation efficiency because 
the bus is being used rarely. In a test on one embedded system, a saturation efficiency of 20% was measured. Thus, 
there is an opportunity of obtaining up to a fivefold increase in performance from the bus. 

Combining the possible increases in burst efficiency and saturation efficiency, it may be possible to obtain about a 
tenfold improvement in throughput for the same memory currently used. 

A first aspect of the present invention is characterised by: means for issuing burst instructions to the memory 
access controller, the memory access controller being responsive to such a burst instruction to transfer a plurality of 
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date words between the memory and the data buffer in a single memory transaction: and means for queueing such 
SrstTri^ucSorso tS such a^rst instruction can be made available for execution by the memonr access controller 
immediately after a preceding burst instruction has been executed. 

A second aspect of the invention is characterised by: means for issuing burst instructions to the "memory acce^ 
; contro^eTTach such burst instruction including or being associated with a parameter defining a ^^-='"9 ^e^.?!" '^^^^ 
Sns°n fr.e memory to be accessed in response to that burst instruction, and the memory access corrtroller be^ 
SonsJe to rch a bu^t instruction to trarS^sr a plurality of data elements between the memory, at locatons spaced 
in arrntriance wtth the soadna parameter, and the data buffer in a single memory transaction. 
TtSS^aiect ^ ?i«ion provide a method of operating a computer system as indicated above. compns.ng: 
10 identlSg inTource code computational elements suitable for compilation to. and execution wrth 

|2S one data buffer: transfZing tiie identified conputational elements in the source code to a senes of opera^ns 

ir^S a memory transac^on no larger than the size of the at least one data 
ations as burst instructions: and executing the source code by tine processing system, wherein the identified computa 
t^ na^leme Js aJ processed by the processing system through accesses to the at least one data buffer. 
IS Other preferred features of the invention are defined in the appended claims. whirh will 

Ttie present invention is particularly, but not exclusively, applicable only for certain classes of algorithm whidh wiM 
he t^r^^ -^Sir^tensive" algorithrre. By this, it is meant an algorithm employing a regular prograrri loop which 
ScSs J^^o^gt^rs vSo^^ Sy data dependent addressing. These algorithms exhibrt high spatial 'ocali^and r^"" 
r^^^S temporal locality, lie high spatial locality and regularrty arfees because rf an-ay 't^"; " "S^- *!^" 
h^l ritely that array item n+s will be used, where s is a constant stride between data elements in the array. The low 
temooral locality is due to the fact that an an-ay item n is typically accessed only once. 

Sa^caches are predominantly designed to explort high temporal localHy J^^^^^J^^f 
Often dose to the processor. Spatial locality is exploited, but only in a very limited way by the line ^^^^'^^'^'^^^^'^-JJ^ 
fj^!nr^^iv unrt^ride and relatively short These two reasons mean that caches are not very good at handling media- 
JatoTrtlr in o^'^^^^^^^^^ often replaces useful data in the cache arj tiie DRAM bandwidth is not 

mSSSTft is beSeved ttiat traditional caches are ideally suited to certain data types, but not media date. 

■?Se mJn di«e ence between the burst buffering of the invention and tradrtional cadies e the fiH policy i.e. «*,en 
fthe fit of the invention) and how (the second aspect of the invention) to fill/empty ttie contents of the buffer. 

^ nl^^Snclwrtlire invention. ^^^^^^ 

mav augment (i e sit atongside) a traditional data cache and may be used for accessing, in particular but not exdu- 
^vely rSSat^: The use of DRAM or the like can then be optimised by exploring the media data *ararter.s^,cs.^d 
t|!^ dateSche Sn operate more effectively on other date types, typically used for contid. It also appears that he date 
cache sizTmry be rSuced, as the media date is less likely to cause conf lids witti the date in the cache, wrthout sacn- 
f iSg pXZnclTossibi; it may prove to be ti.e case tiiat the totel additional memory required for ti,e burst buffers 
is of the same magnitude as the savings in memory required for the date cache. ^ . „4,«=.« 

A svitl Zy contain several buret buffer. Typically, eadn burst buffer is allocated to a resp^'ve date Jrea,^ 
Since SiTi^e a varying number of date streams, rt is proposed to have a fixed amount of SRAM available to 
trbuSSTs TTamoum i^y be divided up into equal sized amounts according to the number of buffers required^ 
For eXHthe amount of fixL SRAM is 2 kByte. and if an algorithm has four date streams 
nSaht be partitioned into four 512 Byte burst buffers. Another algorithm with six streams could be ^9°^ by di^id-ng 
iteXto eirburst buffersLh of 256 Bytes in size. In other words, where the number of date streams is not 
a power of tiwo. the number of burst buffers is preferably the nearest higher power of twa 

In architectures according to the invention a burst comprises the set of addresses defined by. 

45 burst = {B + S X i I B.S.i s A/ a 0 ^ i < U} 

where 8 is the base address of the transfer. S is the stride between elemente. L is the length and N is the set of natural 
Se^. though not explicrtly defined in this equation, the burst order is defined by , incrementing from 0 to L-1. 
Thus, a burst may be defined by the 3-tuple of: 

" ^'"^W^^^ St m^^^^^^ be defined by the element size. This implies that a bu.t maybe siz^ in bytes, half- 
woixls o^S: T^euL cJ stride must take this into account. A "sized-burst" is defined by a A-tuple of the form: 

55 for the Lpping of software s'^ed-bursts into channel -bursts. The channel-burst may be defined by the 4.tuple. 
(base address, length, stride, width) 

If the channel width is 32 bits (or 4 bytes), the channel-burst is always of the form: 

(base_address, length, stride, 4) 
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or abbreviated to the 3-tuple [base_address. length handled at a higher level by 

accompanying drawings, in which: 

„ a «o* o. , o, a =y«an In acc«-anca - « P-en, 

invention; 

■ . hum bullet memory ,n(lburelina«uoWq"«'el"«'»»»*»»""°"'"' 

.aeeh^..*ag.m^a..^.~..r.o.«ec«.-..ee»n<l.m^«.e«..ec.mp.«e,e,elem 
°^ Inaecomanoevdin me presenl invention; 

te a echemrto diagram o( the bolter conlrolter ol Rgor. 3; 

■.ae*.m«.dl.,ram,l^e«.ng*..on=«.n.,o»«o,re,-^ln*al».r=.boll-e,.««c^^ 



Figure 4; 



Figure 5: 
20 Figure 3; 



30 



35 



leadla^ml,,oe«.™.e«e,^lnmao»»«onol.Wa,lna,ro*nO,«.ore.Wte, 

architecture ot Figure 3; and 

lsad,p»«eno,,ra^=eeo^.«.so.».o,de«ns,.rm«,lo,com„^»..oa=on^.rs,e. 

tern according to the invention. 

„e«.«».=o,.1,ma».^"sy=«n.«»J,':'ra»£:=,»^^^ 
memo,, 14 euoh as EDO DRAM, a ™;" J^^"" '^"'^s^cessor ,0; an SRAM «a o«;ha 19 

ra ria'sr^rr=v=:t^^«--^^^^ 
sr=J.°u"::sSn"ss^rrr.::^?^3/-.sopro.^^ 

line in the drawing. M^ =. nmrassor interface 12. for example a coiDrcwessor for the processor 10 

The burst buffer system 24 includes: (1) a PJ^^/^"^ \^ 2 kBytes; (3) a range comparator 28 

(2) a burst buffer memory 26 provided by a f oced ^'^"•^ '^^^^^ the processor 10) determine whether 

can accept memory ^^^"f^/'": an access to the main memory 14 to 

the required data Is resident in the burst at iS one FIFO, which can receive burst instruct, ons from 

fetch thedata; W ^^^^^^ the current system status, extmctthe next r^- 

the processor 10; (5) a burst controHer 32 that Instruction or request to the ma.n memory 14. 

evant burst instruction from the burst instruction "^^J^^,^^^^ and which may be updated by some specific 

(6) a parameter store 34 which holds parame^e s *° ''"'f * ^^^^^^ the burst buffer memory 26 and the proc- 

Knstructlons: (7)datapaths36ato36dfarthe movem^^^^^^ 26 for missed data; 

example as a single 2 KByte buffer 40(0). as shown in ^J; «^ %P^/^ ^ ^ Byte buffers 40(0) to 40(7). 
Z^eZB: as four 512 Byte buffer 40(0) ^^^^l^J^lX^Sif^Ll the'main memory 14 to the processor 
as shown in Figure 2D. Also, each buffer may be ^^^"9«^^^^^^;; f^om the processor 10 to the mam memory 
10 (for example as for buffers 40(2). ^f^^^'^^^'^^^^^ 'or as at«nal buffer (for example as for buffer 40(0) in 
14 for example as for buffers 40(0). ^^^^^ " ^'9" ^^so Stf same number of FIFOs 42 as the number of 

K^.r:?sr^'"f32Hs^rcrn4*rar^^^^ 
-~c°oSrircrr:rr:s^ror^^^^ 
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eters, takes the form: 

Id (g5).r4 

This instructs the processor to fetch the data word pointed to by the address in its register gS and to place that data 
word in its register r4. However, in one embodiment of the invention, the instruction set is extended to include an equiv- 
alent loadburst" instruction, which, with its parameters, takes the form: 

loadburst Sfc,sfride,size,bu/ . ^ , 

•mis causes a burst of size data words to be transferred from the memory 1 4 to that one of the input or bidirectional 
burst buffers AQ(buf) having the identity but. beginning at address src in the memory 14, and with addresses incremwit- 
ing by stride memory locations. There is also a corresponding -^toreburef instruction, which, with its parameters, takes 
the form: 

storeburst ftuf,src,sfr^/ye,s/ze . 

This causes a burst of size data words to be transferred from that one of the output or bidirectional burst buffers 
AO{buf) having the identity buiXo the memory 1 4, beginning at address src in the memory 1 4, and with addresses incre- 
menting by stride memory locations. . . ^ i_ • „ 
The instructions loadburst and storeburst differ from normal load and store instructions in that they complete ma 
single cycle, even though the transfer has not occurred. In essence, the loadburst and storeburst instructions tell the 
memory interface 1 6 to perform the burst, but they do not wait for the burst to complete. 

In ttie above system, the memory interface 16 must, within reason, be able to service burst requests wrth any size 
and stride. There must also be a high degree of coupling to the microprocessor 10. with the best solution being com- 
bined integration on the same chip. Memory requests from the processor 10 may be performed in several manners, two 
being- (a) using a memory-mapped register for the burst instruction queue 30; and (b) using a coprocessor interface to 
bypass the load/store mechanisms. The latter of these is preferred, but requires architectural features not always 
25 pr^ent in a processor. Using the latter model also requires the definition and use of new processor instructions. 

One of the main advantages of a cache is that of transparent correctness. The correct data is always given to ttie 
processor and updated in main memory whenever appropriate, using hardware methods invisible to the processor. The 
burst buffer system 24 also provides similar functionality. i„w 
In the above system, ttie data in a burst buffer 40 is copied from a region of main memory 14. The location nfor- 
30 mation (i.e. address, stride etc.) is compared against any memory request from the processor 10 to determine rf it turn 
in the respective buffer 40. The comparison can be performed in a couple of ways: all addresses in the buffer 40 could 
be held and associatively compared by the range comparator 28 witti processor address (as for normal cache tags); 
and an equation specifying the addresses in ttie buffer can be examined by ttie range comparator 28 using the proces- 
sor address to see if it is a solution. The former is expensive (and gets even more expensive for higher speed) whereas 
35 the latter is cheap and fast, but restricts ttie stride to powers of two to obtain satisfactory performance. 

A read hits in a buffer 40 if the address range comparison is true. In ttiis case the datum is retumed very quicWy to 
the processor from ttie buffer. On ttie other hand, a read miss causes ttie required datum to be extracted from mam 
memory 14 directly, bypassing ttie burst buffer memory 26. However, if ttie datum is in a range ttiat is currertly being 
loaded ttie read is "stalled" or -blocked" until ttie range is loaded and ttien it is extracted from the buffer 40 and passed 
to ttie processor 10. (In a modification, the datum would be passed on as soon as it was received to save latency.) If the 
datum were in a burst ttiat is due to be issued, then the read may again be blocked until ttie burst is performed in order 
to prevent the datum being read twnce in dose succession. , ^ . . ^^.^ .u,. 

A write hit causes the datum in ttie respective buffer 40 to be updated. The main memory 1 4 is not updated at ttiat 
time but coherency with the main memory 14 is achieved under software control by performing a storeburst sometime 
later' On ttie other hand, a write miss causes ttie datum to be updated in the main memory 14 directly unless a store- 
burst is pending or active containing ttie same datum. In ttiis case ttie write is blocked until after ttie storeburst has com- 

'"^^le burst controller 32 for issuing instructions to ttie memory interface 16 may use a mechanism which will be 
termed "deferral". This means ttiat tiie time at which the instruction is issued is deferred until some later time or event 
For example, if ttie next instruction were a storeburst-deferred-ISaccess. it would wait until 16 accesses into the burst 
buffer had been completed, and then automatically issue the stora Other deferral mechanisms may be based on: time 
(i e count cycles); events such as external interrupts: and buffer full/empty indicators. Using deferral on access count 
is a powerful feature of ttie burst buffer system 24 because it allows decoupling of ttie program flow and ttie issuance 

of insti-uctions to the memory interface 16. 

The burst buffer controller 32 provides status information back to the processor 10 on path 38f. It also provides a 
unique buffer identifier so ttiat software-conttolled buffer allocation and logical buffer renaming may be implemented. 

If ttie lengtti of a burst is longer ttian ttie size of the respective buffer 40, one procedure is to truncate the burst 
length so ttiat it is ttie same as that of ttie buffer 40. However, in a modified procedure, a much longer stream is brought 



40 



45 



SO 



SS 



6 



EP0 862118 A1 



"""^iSere now follows a description in greater detal of the operation of the burst bijfer system 24. 

■.J^S^^^^^S^-^- "te » lo*ed 1 .cc«s is prevented) Is cop«d «. n,em»,. The buffer 

35 is then invalidated. ^ i«x«„^^ Tiiic mpan<; that a count is associated with each 
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''"^ A buHer may be made valid by an "allocbuffer" Instruction. This instruction is similar to '°^"'^J"^*"f T,'" 
" '""^efe ^:rri:^T'::Z;:?Zrua^r.. -mis .mp-y invalidates the buffer leaving Hs contents unchanged. A 

"%1SSlte^Src,S^:SSa^e™p>,«.hec«e,,n,,,ed 

» ■^„^s:r',;ns^;«^x.*r«*e^d..-aK»^^ 

of buffer, required b, e part«.u1er epp«eafl»> or s,e«m «.! very depe^^^^ 

number of streams that need to be supported, etc. 
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handled just like any other loadburst '"sfrurton _ conditions, it may be brought into 

Provided the application can guarantee ttiat a J^^J^^^J ""^fstream into bursts. The conditions are: (1) each 
memory automatically with the "memory contron^decompoar^g ^J''^ ^ accordance with a pre- 

dala element is brought in sequentially; (2) each ^ata elemert .s eJer^^V o^^^^ ^^^^^ 
determined usagepattern; (3, pr^e^mg^^^^^^^^^ 

:;r.srre=r^^^^^^ 
riS,r:re2b?s^^^^ 

by the size of the buffer. ^ ♦„ ^wnrt a h..rst transferlts functionality is restricted in that is cannot termi- 

A "burstabort" command may be used to abort a he cancelled In any case, the buffer is 

>egi»« rannes are: 1 . base.a*lress; 2 Jenjlh. f ;„'"f"~ '/"^n^ the ooolrol rejiae. and stalus mtormalion 

.^rr=r„r»'rrrat^f^s:;fr::r.™^^ 

main memory given by the following equation: 

AddressH = base.address + (l-1)*stride (Equation 1) 

where i ranges from 1 to length. ^ , k. inQtmriion is issued the base address register must be ini- 

.a.:er=a*i^rfSr^=r.raSr===.--™^ 
is never automatically changed, even for stream access^^ ^ ^^^^ .^^^^ gj,^^^. 

o.edT.vrret:r;r^^^^^ 

'"'Ihe base address is spec«ied in bytes and must be wokI aligned « it is not. the value is automatcally truncated 
and no rounding is performed. The value 'e«'*J.^°'".*^^;f a.'^*"; 'J ""^^^ register must be initialised. When a 

Reganiing the length register, before any burst '^'f "^f^ Fo^non stream bursts, the length register 

Vft«e. to Ihe terWh regiar when « buffer » "» ™^*'" eisequenHy invalidaled. 
hardware to ensure the. any '•^■•'^''■r'}';^;,^^^^^ Swe ill. Oe.VlWtaa). vau.s that are 

•""vSeir*:s^?re,^--.'---«rr'2^X^™^ 
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Jo .^orfnrmpd The value read from the register is unal- 
aligned to word boundaries will be truncated and no rounding .s performed. The value 

<ered. ^ . . ^ „„ aliases of writes to the control register. The data 

allocbuffer: [000] 129 reserved bitsj 

. a buffer remans in a buffer through invalidation, a buffer may be Remapped" by changing the 



Because the data in « «^ . 

for this instruction is: 

burstabort: [001][29 reserved bits] 

A -freebuffer" instruc«on is us«. to inv^idate the assodated buHer. No parameters are used.The format is: 

25 freebuffer: [010]p9 unused bits] 

data WO maio mamory. raspoewoly. Th. ?"«*,^'~ bura aaaoasos aco geoaralad acoorfing 

loadburst: [011][V1[12 reserved b.ts][16 bH deferral_count] 
storeburst: t100][V][12 reserved bits][16 bit deferral -count] 
- whereVisava.uebita.ind.catesv.eth..^^^^^^ 
i^^aTciuraccrSa;^^^^^^^^^ 

-r^ss^"Sd^^;;r^^^^^ 

complete data stream from the buffer, respe'*;^'^ JJ^' ^^^^^ the^Jam S^to a set of bursts that are transferred from 
to bytes. The buffer manager ^"t"^*'^"^ ^^^"^^^^ Si^ of the buffer is automatically co-ordinat^ 

memory to the buffer and to memory ^^.^.-f^^^^^^S cSunt^^^^ that a buffer is replaced by the next 

by hardware. Burst boundaries ^^^J*^^'^^^^^^^ mechanism for progressing through the stream 

buffer in the sequence after a predefined ""'^f' J^"'^^^^^^ stream using another instruction. For a stream 
is available, but it IS possible to cOTSderme^a^^^ 

loadstream: [101][V][12 reserved bits]l16 bitdeferraLcount] 
storestream: [110][V][12 reserved bits][16 bitdeferral.count] 

where V is as defined above. The bottom 1 6 ^^l^^^!^^;;,^' ^f^^^^^i^ burst buffer. It is the only read com- 

.anS2^r^^:;=rrarr^^^^^ 
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the current mapping may be obtained by reading the other registers. No other information is available. 

A Jt^hXiimL of a comouter svstem in accordance with the present invention is shown in Figures 3 and 4. 
.„ thfsl^iST^S^ ^ci^^^^^^ first embodiment is replaced by an interface based around a pair of 

and S^^"eSTSr(BAT) describing regions of burst buffer memory. In this embodiment, a homogeneous 

To r^ei^es cLn^inputs from eight burst control registers 52. lnfom«tion contained in these two tabl^ is bound 
Seth^rruS to dicribe a o^mplele main-memory-to-burst-buffer transaction. Oujjuts are P"^ fr^-^ *^ 
bX comrX sTto direct memory access (DMA) contrt^ller 56 and hence to memory datapath arbrter 58 to effect 

^^^^nr:r;rerbuTs;;^^^^^^ 

tha aid ^Tbe used for rapid processor access. For this architecture to be advantageous, it 

Ton'^T^sL^v fo° aiSom^^^^^ me^ry 26 to be significantly faster than access from the .riain mem- 

rii-s«=re:i;^^^^ 

SncSSX w^lTcessl'S^^^ there are no appropriate interlocKs or priority mechanism, software 

needs to prevent write conflicts to the same SRAM location. ^ H«.manrfc ac«n- 

4«^RAM needs to be sized SO as to handle several streams in a computation, and to handle the demands asso- 
• , JT^iih^^iJ SSS It fS^ at for a large number of applications providing resources to handle eight streams is 
:SSe^ as « d^^i^ eS^^^^^ « is desirable to have two buffers: onefor incoming and outgoing 

i?eal and *eTh^t sS^^^^^ computation. This suggests finding room for 16 bursts. It is also found 

tIS r^ufts in 128 bytes per burst buffer, and 2Kbytes of SRAM in total to accommodate 16 such buffers. 

^s:rbXrrr;Se%?™^^^^^^^ 

The functionality associated with each bit of this register is set out in Table 1 below. 
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read only 

Table 1: Buffer control register definitions 
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The version register is read-only, and its purpose is ^""^^^^^^ stall the orocessor until the burst instruction 
The sync register is a read-only register that .s used to f^'^^^^^^^^^ instructions and burst 

queue becomes empty. Reading this reg- 

sync instruction. 
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sync register read 
value 


description 


0x0 


no stall was required - the read compieiea 
inunediately 


0x1 1 


stall was required until a storeburst completed 


0x3 


stall was required until a loadbursi completed 


0x5 


reserved for future use 



Table 2: sync register read values 
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''"capacity is also provided to allow a writa to this register in exceptional circumstances: specifically, to permit an 
-^te^rrrrsfe™-^^^^^^^^ 

controller. Figure 5 shows the structure and position of ^ controller. It is a read-only 

this ^iSf the burst instruction queue at a context swftch. Having disabled burst 
The queuetop register is useo ror empiy.na ^ ^ . be read, instruction by instructon untl 

instruction execution by wr.t.ng 0 to ---^^'^^^^^^^^^ not the instruction that is execut- 

°'^rre;th^;rrrh= 



Table 3 below. 
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bit 


name 


readywrit 
e 


description 


default 


0 


step 


read/writ 
e 


single step next instruction from 

Mirct irmfniptinn nucuc 

writes: 0 = do nothing; 1 = step 
reads* 0 = idling (instruction 
complete); 1 = step in progress 


0 


15:1 


unused 






0x000 
0 


16 


bufifer_ovemm_wam 


IXZaSSi Will 

e 


buffer area overrun warning 


0 


17 


stride ovemjn_wani 


read/writ 
e 


stride value warning 


0 


18 


niemtab_modified_wani 


read/writ 
e 


memory access table entry 
modified after burst instruction 


0 


19 


buftab_modified_wam 


read/writ 
e 


buffer access table entry modified 
after 

hiir<!t instruction issue warning 


0 


20 


invalid_instruction_wam 


read/writ 
e 


invalid instruction detected and 
ignored by memory controller 
warning 


0 


31:2 
1 


unused 






0x000 



Table 3: debug register definitions 



The burst instruction queue 30 comprises as before a FIFO memory. Burst instructions are provided by the proc- 
essor: compilation of source code to this structure is described further below. In this embodiment, four fields are pro- 
vided in the burst instruction. These are: 

1. Instruction 

2. Auto-stride indicator for MAT {blockjncrement bit) 

3. Index to entry in MAT used to control transfer 

4. Index to entry in BAT used to control transfer 

The fundamental operation is to issue an instruction virtiich indexes to two table entries, one in each of the memory 
access and buffer access tables. The index to the memory access table retrieves the base addr^s. exterrt and sfrid^^^ 
used at the memory endrfthetransfer-Theindex to thebuffer access tableretrievesthebase address with^^ 

buffer memory region. It should be noted that the indices Fovided In the burst instructton are not in themselves address 

values in the en^odiment described (though they may be in alternative embodiments. In the presen 

masking and offsets are provided to the index values by a context table, as will be discussed further below^ The dirert 

memory access (DMA) controller is passed the parameters from the two tables and uses them to specify the required 

transfer. 

Two alternative formats are provided, as is indicated in Table 4 below: 
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Field 


T^nrmat A 
bufcntl.swap = 0 


Format B 

bufcntl.swap = 1 


instruction 


bit 31:30 


bit 31:30 


blockincrement 
indicator 


bit 29 


bit 29 


memory access table 
index 


bits 28:22 (upper 
field) 


bits 21:0 (lower 
field) 


buffer access table 
index 


bits 21:0 (lower 
field) 


bits 28:22 (upper 
field) 



Table 4: Instruction format options 



»hil5l , value ol 1 selects (ofm.1 B. The use of '^.S^SJ^STJ^^ a DMA to a different ™gior both lor 

view of the different . ^ j^^^ instruction indexes parameters in the MAT and BAT. which 

d.infr~^ofTer^^^^^ 

entry In the MAT "Pf^^J^S^^^^ also indexes parameters in the MAT and BAT. 

againwr^net™ 

f^ofte^died en.y in ™om^^^^ I^ZT^^^'^^O.Z^XT..... main pu^ose of 

Sync, and also Null, is achieved by setbng bits 31 *° "J*™ . ^ 3^a,e and the burst inslruc- 

this instruction Is to provide a «y"««°" -"j;^^^^^^^^^^ from entering the 

tlons. Writing a sync instruction to the burs^ mstruchon P^^J^^^^ ^le queue at any one time, and also 

queue: It locks them out. This means that there can only be one '"^^^^^ j^Lte a DMA access 

?hat reading of a sync instmcdon Indicates that the ''"^"^ '^J-JP^.,^^^^^^^ sync instructions is 

but does activate the synchronisation mechansms associated with the sync register. 

discussed further below. •u^^rfi,roforonre>toFiaure4 Thlsisamemorydesaiptortabie 

Memory Access Table (MAT) 65 will nowbedescnbedwrth^^^^ ^ ^„ 

dXent implementations are of course possible. Each entry comprises three fields. 

which would cause difficulties for the memory controller. 
^E«ent,e«e,^.«.-«o,t,»«;«.^™^sJe^^^ 

3. Stride {stride) the interval between successive elements in a transfer. 
. Each of the fields may be read as a norma, memory mapped register. Each register Is 32 bHs wide, but only 

selected fields are writable, as indicated in Table 5 below: 
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memory access 
table register 


wniflDie iicia 


comments 


memaddr 


bits [31:0] 


no restrictions 


extent 


bits [10:0] 


restricted to size of buffer area 
(2047) 


stride 


bits [9:0] 


restricted to 1023 



Table 5: Writable fields within MAT registers 
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memaddr This is the 32 bit unsigned, word-aligned address of the first element of the ^^^'^nnel buret. Ill^ally 
aligned values may be automatically aligned by truncation. Reads of this register return the value used for the burst (so 
if truncation was necessary, the truncated value is returned). u inHn. 

extent- The parameter in the extent register is the address offset covenng the range of the burrt transfer. If the 
transferrequiresrelements separated byastrideofS. then the extent isS*L.Whenaburst is execi^^ 
controller if this value plus the memaddr value is greater than the size of the buffer area, then a 
bufcntl.buffer_overmjn_warn flag is set. The resultant burst wraps around to the beginning of the buffer area. The 

are restricted in the range of 1 to 1024. Values greater than 1024 are automatically truncated to 1024 and a tutc 
nas^^o o^rrur^^arnnag is set. Reads of this register return the value used for the burst O-e. '^^."^^'^^fj^ 
then th'e truncated value is returned). Also, strides must be multiples of the memory bus writh. which case .s 4 
bytes. Automatic truncation (without rounding) is performed to enforce this alignment. The default value provded is 0. 
equal to a stride length of 1 . 

An example of values contained by a MAT slot might be: 

SSlh rSs^in a 32 word (32 4 byte words) burst, with each word separated by 4 words (4 4 byte words). 
The auto-stride indicator bit of a burst instruction also has relevance to the MAT 65. If this bit is set '" the burst 
instruction, the start address entry is increased to point to point to the next memory location should the bu-Jha^e con- 
tinued past 32. This saves processor overhead in calculating the start address for the next burst in a long sequence of 

""^Tle bSSlccess table (BAT) 66 will now be described with reference to Figure 4. This is again a memory descrip- 
tor table, in this case holding information relating to the burst buffer memory area 26. Each entry in the BAT 66 
describes a transaction to the burst buffer memory area 26. As for the MAT 65. the BAT 66 comprises 1 6 entries, though 
can of course be varied as for the MAT 65. Each entry in this case comprises two fields: 

1 . Buffer address {bufaddi) - the start of the buffer in the buffer area 

2. Buffer size {bufsize) - the size of the buffer area used at the last transfer 

Again, each of the fields may be read and written as a normal memory mapped register As for the MAT 65, each 
regist^is32 bits wide, but only selectedfieldswrthin the register are writable, as set out in Table6below. All unwritable 

bits are always read as zero. 
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buffer access table reg- 
ister 


writable field 


comments 


bufaddr 


bits [10:0] 


limited by buffer area 


bufsize 


bits [10:0] 


limited by buffer area - only written after a context switch 
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15 



20 



25 



30 



-The remaining feature of *«^"«f ^^f^fouiut the corresponding slots to be ^'r^ons^w^^^^^ of 

2.Memorymask(memmask)-th.s.sthemas app ^^Wna (see below). This 

ofanoffset. ^ ^et added to the BAT of a burst instruction after masking (see be j 

. MAT and BAT to be defined, desirable as more 

feature of the <f Jf^ ^'^'J^J^^e the burst instrurt-on results m "J „ ^ ,hen 

'■'if ^.^^er^^<^T^^^u.es a panern of the form 2 3^ 0 1 2. 3 0 1 2^3^^^ ^^^^ ,3, 

ters 52 and elsewhere inthebdfer cont^^^^^^^^ 

maximum value is 16. Higher ^^^^ ^.^^ ^^^^ 

maximum value permitted is 1 5. inis corre ^ 

'9"°'®*^' ■ Q AT Rfi As the table size is 1 6. the max- 
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10 



15 



20 



25 



30 



35 



imum value is 16. Higher values may automatically be truncated to 16. and negath^e values may be permitted. 

ignored. 

The DMA controHer 56 therefore recsK^es .rem the buHer^^^^^^^^^ 
an attached main memory address, stride and transfer le^Jfrcm MAT «^ and a ^^^'^ ^.^^ ^^^^ 

rritn^s^rrrx^^^^^^ 

shake signal can be provided to indicate when the ^^^^^l^^^^^, ^^n skilled In the art will appreciate 

.at=s^^::^^^ 
Shirerap^^wor^^^^^ 

"Ts^LroSrr°:Sma^^^^^^^ 
dateXe^rlor^yoLprocess. warning .^gsma^^^^^^^^ 

Since the archrtecture can be PI^S'^^-^^^Jf^^'T^^ "'S^^^c^^i^ bu^^ TO^ alLs the processor to swrtch 
advantageous to be able to rename the ''"[^^ ^ff/^^^^^,.^* ^^^^ would be used for the computation 

^SSoThTve Xeted. the buffers are renamed (swapped) and the p«)cess proceeds aga,n. 
ToTttiis. the BAT table can be exterKled to include 3 addit.onal register fields. 

Original fields: buffer_start_address, buffer_size. 
New fields- buffer_offset_A, buffer_offset_B. Select_bit 

selected. If an instruction is issued which references this BAT slot. automatically inverted (or 

then immediately after It has been written offset.X address 

toggled)bytheburstbjmerconJc^^.^^^^^^^^^^ 

;resrr^?s^r.^^rs^^^^^ 

40 is DQl selected by the seiect_bit. 



I.e. 



Select_bit = NOT Select_bit; 

45 



if (Select^bit = = 1) { 
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buffer_start_address = buffer_offset_A 
DMA_buffer_address = buffer_offset_B 

} 

else if (Select_bit = = 0) { 

buffer_stan_address = buffer_offset_B 
DMA_buffer_address = buffer_offset_A 

> 



The processing of burst ir,structions in the second embodiment a. the archhecture of the invention is discussed 
below with reference to Rgure 6. indinated above this contains an index to an entry in the 

decoupling is necessary to achieve the P«'^°''"«"°2^„'^ili^^^^^ on data obtained earlier. The 

memory before it is required by the P^°^==.°;- Burst instructions can thus 
processor passes instructions to the queue .n a single <=y^®- ^""/Jf^!;™ coLetion (however, in some embodi- 
le described as "non^locking-: they do not P^^^f^^,^^^^^ " eL 3?^^^l when a new burst instruction 

ments the processor can be ^^'^ queue 30 is read by the DMA 
is issued until room for the new burst instrucfion ''^^^^^^'^S^ o, the burst instructions (i.e. when the 

controller 56 which, when ready to access the ^^" "l^'l i^^^ro^lue and bedns executing that instruction, 
burst buffer interface has priorrty). reads the "^""^S^es fn S^^on ^ecution. Firstly, there is 

As shown in Figure 6. this arrangement leads to four d'^!'"'^ P*^^^^=. '"^^-n is resident in the burst instruction 
the "pending" Phase directly afier the '"^^^ ^^^^^^^ 

queue 30. When the instruction has passed down the queue and ® ,^ ^ contents of the burst buffer in 

L is performed: this is the "transfer" phase. A«er thet.nsf«^ph^^^^^ 

srsLra^^:?^:.^^^^ 

stege. . K<.*«,aan 9 nhv<dcal buffer and a region of main memory, this bind- 

Issuing of the instruction defines an run-time is central to the flexibility 

methods available for perfomiing synchronisation using sync: 
, 1 issue sync instruction and read the sync register. The read is l^ocked until ^1 burst instructions in the queue 

all S MOOS in the queue have completed and the sync instruction emerges. 
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10 



15 



20 



3. issue sync instruction and poll the lastcmd register for the sync instruction. 

Methods 1 and 2 block the processor, whereas meth^ 3 ^oes "°t ^.^ embodiment, 

^^^^ 

.lln^rdlsredT^^^Sra^rJ^^^^ 

The txirst butter architecture identitied « Pf';^^"'^^^^^^^^^ in media 

Simple operations repeated on large arr^soda^a-su^loopslm^^^^ ^ 

rcrreSTd^r^^o^xc^^^^^ 

Relevant loops can be «e.«ied bv hand—e, ^^^^^^^ 
(see.forexample/CompilerTran«or^^^^^^^^ 

Oliver J. Sharp, Technical Report No ^^6/050-93 781JJ™v w ^^^^ 
sary for the loops to be translated ODrrectly .mo « ^rm^'* ^ ^ «ej ^ ^^^^ „eeds to be 

The identified code is in the fomi of « '°°P, J° ^f-^^^^^^pS 
"unrolled" into a series of chunKs, ^^^n^storL hTwS<^^^^^^^^^ « ^^^''^ ''^"'^ 

lecture in terms of a series of burst loads ^"^^=*°;^„"°*^f ^^^^ fe issued, and also before a buffer is used as 
it is necessary to allocate buffers before 1^°^ J may be performed: these may be on the 

a target for computation. Policing the loads and ,arger than a burst) on the loops them- 

unrolled version of the identrf.ed loops, or. ^^^J^^ ^'l ^^"^^ °ore the buffers and the input buffers freed. Once 
selves Immediately after computation, storebursts may be used to store ine 
me ?orSsts are completed, the output buffers can also be freed. 
25 For example, the following code: 

for(i=0;i<imax;i-H-){ 
xlil = f(a[i],blil,c[i]); 
y[il = g(a[i],b[i],c[i]); 

} 

containing three input stream, ain. bin and cin-andj^out^sje^^^ 
be very Lge and typically greater than the burst size, transforms as follows. 

/* do main body of loop using bursts */ 
for(i=0;i<imax,i+=burst_sire) { 

^. /* alloc buffer A for stream a[i] */ 

loadburst A(i.burst_size,l); /* start, length, stride */ 
^llQ^. Q. /* alloc buffer B for stream b[i] */ 
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loadburst B(i,burst_size.l); /* start, length, stride */ 
^j^P ^. I* alloc buffer C for stream c[i] */ 

loadburst C(i,burst_si2e,l); /» start, length, stride */ 
^j^^ ^. I* alloc buffer for compute of stream x[i3 */ 

^j^^ Y /* alloc buffer for compute of stream y[i] */ 

forO=Oj<burst_sizej-H-){ ^ 
x[i^jl=f(atHj],b[i^j],c[Hl); /* perform computes - references hit in buffers 

} 

for(j=0 j<burst_size { 
y[i+jl=g(ati+Jl.''['"^il'<=t^'^il^' 

^. /♦ free input buffer as compute has completed */ 

j^gg g' /* free input buffer as compute has completed */ 

^. I* free input buffer as compute has completed */ 

storeburst X(i,burst_size,l); /* start, length, stride */ 
storeburst Y(i.burst_size,l); /* start, length, stride V 

^. /* free output buffer after store completed */ 

g.^^ y. /* free output buffer after store completed */ 

} 

/* do tail (when imax%burst_size != 0) */ 
for(i=imax/burst_size;i<imax;i++){ 

x[i]=f(alil,b[i],ctil); 
y[i]=g(a[i],b[i],c[i]); 

} 

putation targetted at buffer 04 depends pSe 7 idenWiers. 

output buffer 04 itself. The numbers .n P^'«"*^^^ °" ^^"^^^^^^^^ inefficient, sctiedule. 
^The code as transformed here provides ''"P^f/ ^"'/f^S^^JU tf^e code is issuing bursts, computations are 
This is because the aNrailaWeparalleiismfiasnotbeene^lo^^^ ^ .^^ ^ scheduling program 

not being performed and vice versa). Analysis of the ^^^^^^^ ^se of additional burst buffers. The 

Sch pr'od'uces better schedules. v«th the 'JPJ^^;^";^^^^^^^^^^ for performing a computation 

minimum number of burst buffers J one fo^^achjnpu^^^^^ computation). Through con- 
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■ „,Hor tn the DRAM memory bandwidth as eHidently as possible. This differs from nor- 
™So'n Ser>^S, is d'i^eS to m'^S 'computation efficiency: tf,e Afferent approach arises as the 
:rcrS;r^^"^undercons^^^^^^ 

,n addition to managing the buffer allo^tion the -"^^^J^ ^^^^"^^^^f^^^^^ potential bursts, 

of the Objectives of the compiler i!/;J-^;-J°°^^^^^^ pip'elining) may be 

Once the loops have been ^T""'"^' ^'^'^e^^ Sn the partially unrolled loop at particular points, 
applied. -This latter process requ.res that ^^^^^f fl^^^^^'l"^,'^^ -a- is cycficalty mapped from physical buffer 
For example, a loop computation may ^^^^''^SS' 'S;"^ t^ memory whilst the other buffer is being 

•^resrb?«:::isr^z^^^^^ 

'"^^Atrrnctionthatmustbeperformedbythecompil^^^^^^^ 
SSiSer^r^S^oa^^o:^^^^^ 

with special provision being taken, if f^^' ^ ^^r ^tem 24 will now be described. First, it may be used as a 
Typical simple examples of the use of the buf^tbuWer system 24 wii^ ^ ^.^ 

local Sata buffer, wherein a region of '"^r^^'^^^^'^T? l^'S^eTSi^^^ used to get a buffer 40 and 

be done in two ways: (1) if the data is un.nrtiateed. .'^e" .SmJin mZ^ tiien a loadburst command must 
perform the address mapping; and (2) if the data .s '"J^J^^^ "^^^^J™ 0 may continue to access 

Ee used to copy tiie data into the burst buffer4a ^n^e «V« "^^P!"^^^^^ j^e system needs to be made 

the same addresses and they will be c^tured "^J^^^'f^^^^^^^^ ^ ^ed as a look-up table. In much the 
coherent, a storebuffer command is used. b"r^ birf^^ ^ 

same way as the local data buffer the burs^ '^I'^JZ^^ ^p^atTdate The table size Is limited, afthough there 
mand. References to addre^esthathrt-ntheb^e^^^^^ 

is no reason why a bigger taWe could "° P^^^^^;*! ^J^^ h2.1mp^ove performance. Coherency is not an issue in 
held in the buffer 40 is the most frequentiy used. J^/^ command shouW be Issued, 

this case. TTierefore. once the use of the table l^s been ~"^'^^^^^,;^*;,r4o pot be able to be used as a 
A possible restriction on the system descnbed ^''^^ be possible to write a software 
simple FIFOs. Addresses must always be used to acce^ d^a J^^^^f^^^^al count may be used to emulate 

burst buffer memory 26. processed by the P^<^f f "^^^ ' P"'^^^^^^ ^le processing being performed by 

^^pfoc^Tsr^/Tr^^^^ 

L'ap^eclS mlSf other modifications may be made to the embodiments of the Invention described 
above and within the scope of the invention. 
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Claims 

1 . A computer system comprising: 



55 



a memorj access controller (16) for controlling access to the memory, and 

atTeaTone data buffer (40) for buffering data to be written to or read from the memory, 

characterised by: r^rAmWar the memorv access controller being 

access controller in dependence upon the deferral parameter. 
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. • ^ :n ^aim 1 or 2 Wherein each such burst instruction includes or is associated witii a 
4. A computer system comprising : 

characterised by: ^n^nWe^r pach such burst instruction including 

spacing parameter, and the data buffer in a single memory transaction. 

Is^^L to the date bufler. and H so to access the mapped location m the data txiffer. 
6. Acomputersystemasdaimedlnanyofclaims1to4,whereinsaidmeansforlssuingburstlnstruc.ionstothemem- 

ory access controller comprises: 

a memory access table (65) for description of transactions to the memory (1 4); and 

a buffer access table (66) for description of transactions to the at least one data buffer (40); wherein 

each burst instruction fesued indexes both said memory access table (65) and sa« buffer access table (66). 

and the art least one data buffer (40). 

10. Acomputer system as Claimed in any preceding d^m. wherein the number Of such data buffos is configurab^^ 
the system under hardware or software control. 

accessed by the memory (14). 
12 A computer system as claimed in Cairn 1 1 . wherein the dual port memory (26) can be accessed at the same time 
so by the processing system (10,12) and the memory (14). 

data buffer. 
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15. A method of operating a computer system as claimed in any preceding daim. comprising: 

identifying in source code corrputational elements suitable for compilation to. and execution with assistance of. 
the at least one data buffer (40); 

transforming the identified computational elements in the source code to a seri^ of operations each hwoMng 
a memory transaction no larger than the size of the at least one data buffer (40). and expressing such opera- 
tions as burst instructions; 

executing the source code by the processing system (10. 12). wherein the identified ^P^^^^"^;^^""^"^ 
are processed by the processing system (10, 12) through accesses to the at least one data buffer (40). 

16 A method of operating a computer system as claimed in daim 15. wherein data required by an identified compula- 
tionJ eSuent is fetched from memory (1 4) to the at least one data buffer (40) before ,t .s required by the processing 
IS system (10. 12). 

17. A method of operating a computer system as daimed in daim 15 or daim 16. wherein means are provid^ to st^l 
The processing system (10. 12) until a transaction between memory (14) and the at least one data buffer (40) .s 
completed. 
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